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Likelihood-free model choice 


1. INTRODUCTION 

As it is now hopefully clear from earlier chapters in this book, there exist 
several ways to set ABC methods hrmly within the Bayesian framework. The 
method has now gone a very long way from the “trick” of the mid 1990’s [24, 33], 
where the tolerance acceptance condition 

diy,y°^n<e 

was a crude practical answer to the impossibility to wait for the event d{y, = 
0 associated with exact simulations from the posterior distribution [29]. Not only 
do we now enjoy theoretical convergence guarantees [5, 6, 17] as the computing 
power grows to inhnity, but we also beneht from new results that set actual ABC 
implementations, with their hnite computing power and strictly positive toler¬ 
ances, within the range of other types of inference [38, 39, 40]. ABC now stands 
as an inference method that is justihable on its own ground. This approach may 
be the only solution available in complex settings such as those originally tackled 
in population genetics [24, 33], unless one engages into more perilous approxima¬ 
tions. The conclusion of this evolution towards mainstream Bayesian inference is 
quite comforting about the role ABC can play in future computational develop¬ 
ments, but this trend is far from delivering the method a blank conhdence check 
in that some implementations of it will alas fail to achieve consistent inference. 

Model choice is actually a fundamental illustration of how much ABC can 
err away from providing a proper inference when sufficient care is not properly 
taken. This issue is even more relevant when one considers that ABC is used a 
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lot—at least in population genetics—for the comparison and hence the valida¬ 
tion of scenarios that are constructed based on scientific hypotheses. The more 
obvious difficulty in ABC model choice is indeed conceptual rather than com¬ 
putational in that the choice of an inadequate vector of summary statistics may 
produce an inconsistent inference [28] about the model behind the data. Such an 
inconsistency cannot be overcome with more powerful computing tools. Existing 
solutions avoiding the selection process within a pool of summary statistics are 
limited to specific problems and difficult to calibrate. 

Past criticisms of ABC from the outside have been most virulent about this 
aspect, even though not always pertinent (see, e.g., [34, 35] for an extreme ex¬ 
ample). It is therefore paramount that the inference produced by an ABC model 
choice procedure be validated on the most general possible basis for the method to 
become universally accepted. As we discuss in this chapter, reflecting our evolv¬ 
ing perspective on the matter, there are two issues with the validation of ABC 
model choice: (a) is it not easy to select a good set of summary statistics (b) even 
selecting a collection of summary statistics that lead to a convergent Bayes factor 
may produce a poor approximation at the practical level. 

As a warning, we note here that this chapter does not provide a comprehensive 
survey of the literature on ABC model choice, neither about the foundations 
[see 18, 36] and more recent proposals [see 2, 3, 23], nor on the wide range of 
applications of the ABC model choice methodology to specific problems as in, 
e.g., [4, 10]. 

After introducing standard ABC model choice techniques, we discuss the curse 
of insufficiency. Then, we present the ABC random forest strategy for model 
choice and consider first a toy example and, at the end, a human population 
genetics example. 


2. SIMULATE ONLY SIMULATE 

The implementation of ABC model choice should not deviate from the original 
principle at the core of ABC, in that it proceeds by treating the unknown model 
index 501 as an extra parameter with an associated prior, in accordance with 
standard Bayesian analysis. An algorithmic representation associated with the 
choice of a summary statistic S{-) is thus as follows: 


Algorithm 1 standard ABC model choice 

for i = 1 to do 

Generate 9Jl from the prior TrfOJl) 

Generate 6 from the prior n^{9) 

Generate y from the model fdJiiy\9) 

Set an'*) = m, e'*' = e and s'*) = S{y) 

end for 

return the values Oil'*) associated with the k smallest distances ci(s'*), 


In this presentation of the algorithm, the calibration of the tolerance e for 
ABC model choice is expressed as a A;-nearest neighbours (fc-nn) step, following 
the validation of ABC in this format by [5], and the observation that the tolerance 
level is chosen this way in practice. Indeed, this standard strategy ensure a given 
number of accepted simulations is produced. While the fc-nn method can be used 
towards classihcation and hence model choice, we will take advantage of different 
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machine learning tools in Section 4. In general the accuracy of a fe-nn method 
heavily depends on the value of k, which must be calibrated, as illustrated in 
[25]. Indeed, while the primary justification of ABC methods is based on the 
ideal case when e ~ 0, hence k should be taken “as small as possible”, more 
advanced theoretical analyses of its non-parametric convergence properties led to 
conclude that e had to be chosen away from zero for a given sample size [5, 6, 17]. 
Rather than resorting to non-parametric approaches to the choice of k, which are 
based on asymptotic arguments, [25] rely on an empirical calibration of k using 
the whole simulated sample known as the reference table to derive the error rate 
as a function of k. 

Algorithm 1 thus returns a sample of model indices that serves as an approxi¬ 
mate sample from the posterior distribution 7r(S[)I|y°^®) and provides an estimated 
version via the observed frequencies. In fact, the posterior probabilities can be 
written as the following conditional expectations 

P(9II = m|5(Y) = s)= E(l|2„=m}|5(Y) = s). 

Computing these conditional expectation based on iid draws from the distribution 
of (iUt, ^(Y)) can be interpreted as a regression problem in which the response 
is the indicator of whether or not the simulation comes from model m and the 
covariates are the summary statistics. The iid draws constitute the reference 
table, which also is the training database for machine learning methods. The 
process used in the above ABC Algorithm 1 is a A:-nnmethod if one approximates 
the posterior by the frequency of m among the k nearest simulations to s. The 
proposals of [18] and [37] for ABC model choice are exactly in that vein. 

Other methods can be implemented to better estimate P(9JT = m|S'(Y) = s) 
from the reference table, the training database of the regression method. For 
instance, Nadaraya-Watson estimators are weighted averages of the responses, 
where weights are non-negative decreasing functions (or kernels) of the distance 
The regression method commonly used (instead of A:-nn) is a local re¬ 
gression method, with a multinomial link, as proposed by [16] or by [10]: local 
regression procedures fit a linear model on simulated pairs in a neigh¬ 

bourhood of s. The multinomial link ensures that the vector of probabilities has 
entries between 0 and 1 and sums to 1. However, local regression can prove com¬ 
putationally expensive, if not intractable, when the dimension of the covariate 
increases. Therefore, [14] proposed a dimension reduction technique based on lin¬ 
ear discriminant analysis (an exploratory data analysis technique that projects 
the observation cloud along axes that maximise the discrepancies between groups, 
see [19]), which produces to a summary statistic of dimension M — 1. 


Algorithm 2 local logistic regression ABC model choice 
Generate N samples (9Jt*'*\ s^®^) as in Algorithm 1 

Compute weights uii = where K \s a kernel density and [) is its bandwidth 

estimated from the sample (s^®^) 

Estimate the probabilities P(9JJ = m|s) by a logistic link based on the covariate s from the 
weighted data (QJt*'®\ s^®\ cui) 


Unfortunately, all regression procedures given so far suffer from a curse of 
dimensionality: they are sensitive to the number of covariates, i.e., the dimension 
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of the vector of summary statistics. Moreover, as detailed in the following sections, 
any improvements in the regression method do not change the fact that all these 
methods aim at approximating = m|5(Y) = s) as a function of s and use 
this function at s = while caution and cross-checking might be necessary to 
validate = m|5(Y) = as an approximation of = m|Y = 

A related approach worth mentioning here is the Expectation Propagation 
ABC (EP-ABC) algorithm of [3], which also produces an approximation of the 
evidence associated with each model under comparison. Without getting into de¬ 
tails, the expectation-propagation approach of [22, 30] approximates the posterior 
distribution by a member of an exponential family, using an iterative and fast 
moment-matching process that takes only a component of the likelihood product 
at a time. When the likelihood function is unavailable, [3] propose to instead rely 
on empirical moments based on simulations of those fractions of the data. The 
algorithm includes as a side product an estimate of the evidence associated with 
the model and the data, hence can be exploited for model selection and posterior 
probability approximation. On the positive side, the EP-ABC is much faster than 
a standard ABC scheme, does not always resort to summary statistics, or at least 
to global statistics, and is appropriate for “big data” settings where the whole 
data cannot be explored at once. On the negative side, this approach has the same 
degree of validation as variational Bayes methods [20], which means converging 
to a proxy posterior that is at best optimally close to the genuine posterior within 
a certain class, requires a meaningful decomposition of the likelihood into blocks 
which can be simulated, calls for the determination of several tolerance levels, is 
critically dependent on calibration choices, has no self-control safety mechanism 
and requires identifiability of the models’ underlying parameters. Hence, while 
EP-ABC can be considered for conducting model selection, there is no theoreti¬ 
cal guarantee that it provides a converging approximation of the evidence, while 
the implementation on realistic models in population genetics seems out of reach. 

3. THE CURSE OF INSUFFICIENCY 

The paper [28] issued a warning that ABC approximations to posterior prob¬ 
abilities cannot always be trusted in the double sense that (a) they stand away 
from the genuine posterior probabilities (imprecision) and (b) they may even fail 
to converge to a Dirac distribution on the true model as the size of the observed 
dataset grows to infinity (inconsistency). Approximating posterior probabilities 
via an ABC algorithm means using the frequencies of acceptances of simulations 
from each of those models. We assumed in Algorithm 1 the use of a common 
summary statistic (vector) to define the distance to the observations as otherwise 
the comparison between models would not make sense. This point may sound 
anticlimactic since the same feature occurs for point estimation, where the ABC 
estimator is an estimate of E[0|S'(y°*’®)]. Indeed, all ABC approximations rely on 
the posterior distributions knowing those summary statistics, rather than know¬ 
ing the whole dataset. When conducting point estimation with insufficient statis¬ 
tics, the information content is necessarily degraded. The posterior distribution 
is then different from the true posterior but, at least, gathering more observa¬ 
tions brings more information about the parameter (and convergence when the 
number of observations goes to infinity), unless one uses only ancillary statis¬ 
tics. However, while this information impoverishment only has consequences in 
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terms of the precision of the inference for most inferential purposes, it induces a 
dramatic arbitrariness in the construction of the Bayes factor. To illustrate this 
arbitrariness, consider the case of starting from a statistic S{x) sufficient for both 
models. Then, by the factorisation theorem, the true likelihoods factorise as 

fi{x\9) = gi{x)TTi{S{x)\9) and f2{x\9) = g2ix)TT2{S{x)\9) 
resulting in a true Bayes factor equal to 

( 1 ) 

92{x) 

where the last term, indexed by the summary statistic S', is the limiting (or 
Monte Carlo error-free) version of the ABC Bayes factor. In the more usual case 
where the user cannot resort to a sufficient statistic, the ABC Bayes factor may 
diverge one way or another as the number of observations increases. A notable 
exception is the case of Gibbs random fields where [18] have shown how to derive 
inter-model sufficient statistics, beyond the raw sample. This is related to the 
less pessimistic paper of [13], also concerned with the limiting behaviour for the 
ratio (1). Indeed, these authors reach the opposite conclusion from ours, namely 
that the problem can be solved by a sufficiency argument. Their point is that, 
when comparing models within exponential families (which is the natural realm 
for sufficient statistics), it is always possible to build an encompassing model with 
a sufficient statistic that remains sufficient across models. 

However, apart from examples where a tractable sufficient summary statistic 
is identified, one cannot easily compute a sufficient summary statistic for model 
choice and this results in a loss of information, when compared with the exact 
inferential approach, hence a wider discrepancy between the exact Bayes factor 
and the quantity produced by an ABC approximation. When realising this con¬ 
ceptual difficulty, the authors of [28] felt it was their duty to warn the community 
about the dangers of this approximation, especially when considering the rapidly 
increasing number of applications using ABC for conducting model choice or hy¬ 
pothesis testing. Another argument in favour of this warning is that it is often 
difficult in practice to design a summary statistic that is informative about the 
model. 

Let us signal here that a summary selection approach purposely geared towards 
model selection can be found in [2]. Let us stress in and for this section that the 
said method similarly suffers from the above curse of dimensionality. Indeed, the 
approach therein is based on an estimate of Fisher’s information contained in the 
summary statistics about the pair (9II, 9) and the correlated search for a subset 
of those summary statistics that is (nearly) sufficient. As explained in the paper, 
this approach implies that the resulting summary statistics are also sufficient for 
parameter estimation within each model, which obviously induces a dimension 
inflation in the dimension of the resulting statistic, in opposition to approaches 
focussing solely on the selection of summary statistics for model choice, like [23] 
and [9]. 

We must also stress that, from a model choice perspective, the vector made of 
the (exact!) posterior probabilities of the different models obviously constitutes a 
Bayesian sufficient statistics of dimension M —1, but this vector is intractable pre¬ 
cisely in cases where the user has to resort to ABC approximations. Nevertheless, 
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Importance Sampling estimates of P(M=1 |y) 


Fig 1. Comparison of importance sampling (first axis) and ABC (second axis) estimates of the 
posterior probability of scenario 1 in the first population genetic experiment, using summary 
statistics. (Source: [28]) 


this remark is exploited in [23] in a two-stage ABC algorithm. The second stage of 
the algorithm is ABC model choice with summary statistics equal to approxima¬ 
tion of the posterior probabilities. Those approximations are computed as ABC 
solutions at the first stage of the algorithm. Despite the conceptual attractiveness 
of this approach, which relies on a genuine sufficiency result, the approximation 
of the posterior probabilities given by the first stage of the algorithm directly rely 
on the choice of a particular set of summary statistics, which brings us back to 
the original issue of trusting an ABC approximation of a posterior probability. 

There therefore is a strict loss of information in using ABC model choice, due 
to the call both to insufficient statistics and to non-zero tolerances (or a imperfect 
recovery of the posterior probabilities with a regression procedure). 

3.1 Some counter-examples 

Besides a toy example opposing Poisson and Geometric distributions to point 
out the potential irrelevance of the Bayes factor based on poor statistics, [28] 
goes over a realistic population genetic illustration, where two evolution scenar¬ 
ios involving three populations are compared, two of those populations having 
diverged 100 generations ago and the third one resulting from a recent admixture 
between the hrst two populations (scenario 1) or simply diverging from popu¬ 
lation 1 (scenario 2) at the same date of 5 generations in the past. In scenario 
1, the admixture rate is 0.7 from population 1. Simulated datasets (100) of the 
same size as in experiment 1 (15 diploid individuals per population, 5 indepen¬ 
dent micro-satellite loci) were generated assuming an effective population size of 
1000 and a mutation rate of 0.0005. In this experiment, there are six parame¬ 
ters (provided with the corresponding priors): the admixture rate (^[0.1, 0.9]), 
three effective population sizes (Z^[200, 2000]), the time of admixture/second di¬ 
vergence 10]) and the date of the first divergence (Z7[50, 500]). While costly 
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Fig 2. Comparison of the range of the ABC posterior probability that data is from a normal 
model with unknown mean 9 when the data is made of n — 10,100,1000 observations (left, 
centre, right, resp.) either from a Gaussian (lighter) or Laplace distribution (darker) and when 
the ABC summary statistic is made of the empirical mean, median, and variance. The ABC 
algorithm generates 10'* simulations (5,000 for each model) from the prior 9 ~ ^^(0, 4) and 
selects the tolerance e as the 1% distance quantile over those simulations. (Source: [21].) 


in computing time, the posterior probability of a scenario can be estimated by 
importance sampling, based on 1000 parameter values and 1000 trees per param¬ 
eter value, thanks to the modules of [31]. The ABC approximation is produced 
by DIYABC [11], based on a reference sample of two million parameters and 24 
summary statistics. The result of this experiment is shown on Figure 1, with a 
clear divergence in the numerical values despite stability in both approximations. 
Taking the importance sampling approximation as the reference value, the error 
rates in using the ABC approximation to choose between scenarios 1 and 2 are 
14.5% and 12.5% (under scenarios 1 and 2), respectively. Although a simpler 
experiment with a single parameter and the same 24 summary statistics shows 
a reasonable agreement between both approximations, this result comes as an 
additional support to our warning about a blind use of ABC for model selection. 
The corresponding simulation experiment was quite intense, as, with 50 markers 
and 100 individuals, the product likelihood suffers from an enormous variability 
that 100,000 particles and 100 trees per locus have trouble addressing despite a 
huge computing cost. 

An example is provided in the introduction of the paper [21], sequel to [28]. 
The setting is one of a comparison between a normal y ~ M{6i, 1) model and 
a double exponential y ~ £(02,l/\/2) model^. The summary statistics used in 
the corresponding ABC algorithm are the sample mean, the sample median and 
the sample variance. Figure 2 exhibits the absence of discrimination between 
both models, since the posterior probability of the normal model converges to a 
central value around 0.5-0.6 when the sample size grows, irrelevant of the true 
model behind the simulated datasets. 

3.2 Still some theoretical guarantees 

Our answer to the (well-received) above warning is provided in [21], which deals 
with the evaluation of summary statistics for Bayesian model choice. The main 
result states that, under some Bayesian asymptotic assumptions, ABC model 
selection only depends on the behaviour of the mean of the summary statistic 

*The double exponential distribution is also called the Laplace distribution, hence the nota¬ 
tion jC( 92, Ify/T), with mean 62 and variance one. 
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n=10 n=100 



Gauss Laplace Gauss Laplace 


n=1000 



Gauss Laplace 


Fig 3. Same representation as Figure 2 when using the median absolute deviation of the sample 
as its sole summary statistic. (Source: [21].) 


under both models. The paper establishes a theoretical framework that leads 
to demonstrate consistency of the ABC Bayes factor under the constraint that 
the ranges of the expected value of the summary statistic under both models do 
not intersect. An negative result is also given in [21], which mainly states that, 
whatever the observed dataset, the ABC Bayes factor selects the model having 
the smallest effective dimension when the assumptions do not hold. 

The simulations associated with the paper were straightforward in that (a) the 
setup compares normal and Laplace distributions with different summary statis¬ 
tics (inc. the median absolute deviation), (b) the theoretical results told what to 
look for, and (c) they did very clearly exhibit the consistency and inconsistency 
of the Bayes factor/posterior probability predicted by the theory. Both boxplots 
shown here on Figures 2 and 3 show this agreement: when using (empirical) 
mean, median, and variance to compare normal and Laplace models, the poste¬ 
rior probabilities do not select the true model but instead aggregate near a fixed 
value. When using instead the median absolute deviation as summary statistic, 
the posterior probabilities concentrate near one or zero depending on whether or 
not the normal model is the true model. 

It may be objected to such necessary and sufficient conditions that Bayes fac¬ 
tors simply are inappropriate for conducting model choice, thus making the whole 
derivation irrelevant. This foundational perspective is an arguable viewpoint [15]. 
However, it can be countered within the Bayesian paradygm by the fact that 
Bayes factors and posterior probabilities are consistent quantities that are used 
in conjunction with ABC in dozens of genetic papers. Further arguments are pro¬ 
vided in the various replies to both of Templetons radical criticisms [34, 35]. That 
more empirical and model-based assessments also are available is quite correct, 
as demonstrated in the multicriterion approach of [26]. This is simply another 
approach, not followed by most geneticists so far. 

A concluding remark about [21] is that, while the main bulk of the paper 
is theoretical, it does bring an answer that the mean ranges of the summary 
statistic under each model must not intersect if they are to be used for ABC 
model choice. In addition, while the theoretical assumptions therein are not of 
the utmost relevance for statistical practice, the paper includes recommendations 
on how to conduct a t^st on the difference of the means of a given summary 
statistics under both models, towards assessing whether or not this summary is 
acceptable. 






















9 


4. SELECTING THE MAP MODEL VIA MACHINE LEARNING 

The above sections provide enough arguments to feel less than confident in 
the outcome of a standard ABC model choice algorithm 1, at least in the nu¬ 
merical approximation of the probabilities = m|5'(Y) = and in their 
connection with the genuine posterior probabilities P(9Jl = m|Y = There 

are indeed three levels of approximation errors in such quantities, one due to 
the Monte Carlo variability, one due to the non-zero ABC tolerance or, more 
generally to the error committed by the regression procedure when estimating 
the conditional expected value, and one due to the curse of insufficiency. While 
the derivation of a satisfying approximation of the genuine P(5[R = m|Y = 
seems beyond our reach, we present below a novel approach to both construct the 
most likely model and approximate P(9Jt = m|S'(Y) = for the most likely 
model, based on the machine learning tool of random forests. 

4.1 Reconsidering the posterior probability estimation 

Somewhat paradoxically, since the ABC approximation to posterior proba¬ 
bilities of a collection of models is delicate, [25] support inverting the order of 
selection of the a posteriori most probable model and of approximation of its 
posterior probability, using the alternative tool of random forests for both goals. 
The reason for this shift in order is that the rate of convergence of local regres¬ 
sion procedure such as A:-nn or the local regression with multinomial link heavily 
depends on the dimension of the covariates (here the dimension of the summary 
statistic). Thus, since the primary goal of ABC model choice is to select the most 
appropriate model, both [32] and [25] argue that one does not need to correctly 
approximating the probability 

= m|5(Y) « 5°’’") 

when looking for the most probable model in the sense of 

= m| Y = 

probability. [32] stresses that selecting the most adequate model for the data at 
hand as the maximum a posteriori (MAP) model index is a classification issue, 
which proves to be a significantly easier inference problem than estimating a re¬ 
gression function [12, 19]. This is the reason why [32] adapt the above Algorithm 1 
by resorting to a fc-nn classification procedure, which sums up as returning the 
most frequent (or majority rule) model index among the k simulations nearest to 
the observed dataset, nearest in the subspace of the summary statistics. Indeed, 
generic classification aims at forecasting a variable 911 taking a finite number of 
values, {1,..., M}, based on a vector of covariates S = (5i,..., Sd)- The Bayesian 
approach to classification stands in using a training database (m*, s*) made of in¬ 
dependent replicates of the pair (911, S'(Y)) that are simulated from the prior 
predictive distribution. The connection with ABC model choice is that the later 
predicts a model index, 911, from the summary statistic ^(Y). Simulations in the 
ABC reference table can thus be envisioned as creating a learning database from 
the prior predictive that trains the classifier. 

[25] widen the paradigm shift undertaken in [32], as they use a machine learning 
approach to the selection of the most adequate model for the data at hand and 
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exploit this tool to derive an approximation of the posterior probability of the 
selected model. The classification procedure chosen by [25] is the technique of 
Random Forests (RFs) [7], which constitutes a trustworthy and seasoned machine 
learning tool, well adapted to complex settings as those found in ABC settings. 
The approach further requires no primary selection of a small subset of summary 
statistics, which allows for an automatic input of summaries from various sources, 
including softwares like DIYABC [9]. At a first stage, a RF is constructed from 
the reference table to predict the model index and applied to the data at hand to 
return a MAP estimate. At a second stage, an additional RF is constructed for 
explaining the selection error of the MAP estimate, based on the same reference 
table. When applied to the observed data, this secondary random forest produces 
an estimate of the posterior probability of the model selected by the primary RF, 
as detailed below, following [25]. 

4.2 Random forests construction 

A RF aggregates a large number of classification trees by adding for each 
tree a randomisation step to the Classification And Regression Trees (CART) 
algorithm [8]. Let us recall that this algorithm produces a binary classification 
tree that partitions the covariate space towards a prediction of the model index. In 
this tree, each binary node is partitioning the observations via a rule of the form 
Sj < tj, where Sj is one of the summary statistics and tj is chosen towards the 
minimisation of an heterogeneity index. For instance, [25] uses the Gini criterion 
[19]. A CART tree is built based on a learning table and it is then applied to the 
observed summary statistic 5°*^®, predicting the model index by following a path 
that applies these binary rules starting from the tree root and returning the label 
of the tip at the end of the path. 

The randomisation part in RF produces a large number of distinct CART 
trees by (a) using for each tree a bootstrapped version of the learning table on 
a bootstrap sub-sample of size Woot and (b) selecting the summary statistics at 
each node from a random subset of the available summaries. The calibration of 
a RF thus involves three quantities: 

- B, the number of trees in the forest, 

- ntry, the number of covariates randomly sampled at each node by the ran¬ 
domised CART, and 

- A^boot) the size of the bootstraped sub-sample. 

The so-called out-of-bag error associated with an RF is the average number of 
times a point from the learning table is wrongly allocated, when averaged over 
trees that exclude this point from the bootstrap sample. 

The way [25] builds a random forest classifier given a collection of statistical 
models is to start from an ABC reference table including a set of simulation 
records made of model indices, parameter values and summary statistics for the 
associated simulated data. This table then serves as training database for a ran¬ 
dom forest that forecasts model index based on the summary statistics. 
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Algorithm 3 random forest ABC model choice 

Generate N samples as in Algorithm 1 (the reference table) 

Construct A^tree randomized CART which predict the model indices using the summary statistics 
for 6 = 1 to A'tree dO 

draw a bootstrap sub-sample of size A^boot from the reference table 
grow a randomized CART Ti, 

end for 

Determine the predicted indices for and the trees {Ti,; 6=1,..., A'tree} 

Assign to an indice (a model) according to a majority vote among the predicted indices 


4.3 Approximating the posterior probability of the MAP 

The posterior probability of a model is the natural Bayesian uncertainty quan¬ 
tification [27] since it is the complement of the posterior loss associated with a 
0-1 loss Irrrj^^r^obs) where is the model selection procedure, e.g., the RF 

outcome described in the above section. However, for reasons described above, 
we are unwilling to trust the standard ABC approximation to the posterior prob¬ 
ability as reported in Algorithm 1 . An initial proposal in [32] is to instead rely 
on the conditional error rate induced by the fc-nn classifier knowing ^(Y) = 
namely 

where 501 denotes the fc-nn classifier trained on ABC simulations. The above con¬ 
ditional expected value of is approximated in [32] with a Nadaraya- 

Watson estimator on a new set of simulations where the authors compare the 
model index which calibrates the simulation of the pseudo-data and the 
model index predicted by the k-nn approach trained on a first database 

of simulations. However, this first proposal has the major drawback of relying 
on nonparametric regression, which deteriorates when the dimension of the sum¬ 
mary statistic increases. This local error also allows for the selection of summary 
statistics adapted to but the procedure of [32] remains constrained by the 
dimension of the summary statistic, which typically have to be less than 10 . 

Furthermore, relying on a large dimensional summary statistic—to bypass, 
at least partially, the curse of insufficiency—was the main reason for adopting a 
classifier such as RFs in [25] . Hence the authors proposed to estimate the posterior 
expectation of ^ function of the summary statistics, via another 

RF construction. 

= 1 - P[5U1 = . 

The estimation of E[l 2 ^^^^^^|s] proceeds as follows; 

- compute the values of for the trained random forest and all terms 

in the reference table; 

- train a second RF regressing on the same set of summary statistics 

and the same reference table, producing a function g{s) that returns a 
machine learning estimate of P[911 7 ^ 5Ut(s)|s]; 

- apply this function to the actual observations to produce 1 — as an 

estimate of P[9JT = 
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5. A FIRST TOY EXAMPLE 

We consider in this section a simple unidimensional setting with three models 
where the marginal likelihoods can be computed in closed form. 

Under Model 1, our dataset is a n-sample from an Exponential distribution 
with parameter 9 (with expectation 1/0) and the corresponding prior distribution 
on 6 is an Exponential distribution with parameter 1. In this model, given the 
sample y = {yi,..., y^) with yi > 0, the marginal likelihood is given by 

/ ” \ 

mi{y) = r(n + 1) f 1 + ^ Vij 


Under Model 2, our dataset is a re-sample from a Log-Normal distribution 
with location parameter 9 and dispersion parameter equal to 1 (which implies 
an expectation equal to exp(0 -|- 0.5)). The prior distribution on 0 is a standard 
Gaussian distribution. For this model, given the sample y = {yi,... ,yn) with 
yt > 0, the marginal likelihood is given by 


m 2 iy) = exp 


log(y0 


V 2=1 


2 

l{2n{n + 1)) 


n 


2 

/2 




5^1og(2/*) /(2n)-5^ log(yi) 


^ 2=1 


2 = 1 
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Under Model 3, our dataset is a re-sample from a Gamma distribution with 
parameter (2,0) (with expectation 2/0) and the prior distribution on 0 is an 
Exponential distribution with parameter 1. For this model, given the sample 
y = {yi,..., yri)with y* > 0, the marginal likelihood is given by 


msiy) = exp 


n 

-2 = 1 


r(2re-h 1) 

r(2)" 



-2n-l 


We consider three summary statistics 


Yy^^Y i°s(2/i), Y 


V 2 = 1 


2 = 1 


2 = 1 


These summary statistics are sufficient not only within each model but also for 
the model choice problem [13] and the purpose of this example is not to evaluate 
the impact of a loss of sufficiency. 

When running ABC, we set re = 20 for the sample size and generated a refer¬ 
ence table containing 29,000 simulations (9676 simulations from model 1, 9650 
from model 2 and 9674 from model 3). We further generated an independent test 
dataset of size 1,000. Then, to calibrate the optimal number of neighbours in the 
standard ABC procedure [18, 37] we exploited 1,000 independent simulations. 

For each element of the test dataset, as obvious from the above rrej(y)’s we can 
evaluate the exact model posterior probabilities. Figure 4 represents the posterior 
probability of Model 3 for every simulation, ranked by model index. In addition. 
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Index in the test sample 


Fig 4. True posterior probability of Model 3 for each term from the test sample. Colour corre¬ 
sponds to the true model index: black for Model 1, red for Model 2 and green for Model 3. The 
terms in the test sample have been ordered by model index to improve the representation. 


Figure 5 gives a plot of the first two LDA projections of the test dataset. Both 
hgures explain why the model choice problem is not easy in this setting. Indeed, 
based on the exact posterior probabilities, selecting the model associated with 
the highest posterior probability achieves the smallest prior error rate. Based on 
the test dataset, we estimate this lower bound as being around 0.245, i.e., close 
to 25 %. 

Based on a calibration set of 1,000 simulations, and the above reference table 
of size 29,000, the optimal number of neighbours that should be used by the 
standard ABC model choice procedure, i.e., the one that minimises the prior 
error rate, is equal to 20. In this case, the resulting prior error rate for the test 
dataset is eqnal to 0.277. 

By comparison, the RF ABC model choice technique of [25] based on 500 trees 
achieves an error rate of 0.276 on the test dataset. For this example, adding the 
two LDA components to the summary statistics does not make a difference. This 
alternative procedure achieves similarly good results in terms of prior error rate, 
since 0.276 is relatively closed to the absolute lower bound of 0.245. However, as 
explained in previous sections and illustrated on Figure 6, the RF estimates of 
the posterior probabilities are not to be trusted. In short, a classification tool is 
not necessarily appropriate for regression goals. 

A noteworthy featnre of the RF technique is its ability to be robust against 
non-discriminant variates. This obviously is of considerable appeal in ABC model 
choice since the selection of snmmary statistics is an nnsolved challenge. To illus¬ 
trate this point, we added to the original set of three summary statistics variables 
that are pure noise, being produced by independent simulations from standard 
Ganssian distributions. Table 1 shows that the additional error due to those irrel- 
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Fig 6. True posterior probabilities of Model 1 against their Random Forest estimates for the 
test sample, with the same colour code as in Figure 4- 
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Extra variables 

prior error rate 

0 

0.276 

2 

0.283 

4 

0.288 

6 

0.272 

8 

0.280 

10 

0.286 

20 

0.318 

50 

0.355 

100 

0.391 

200 

0.419 

1000 

0.456 


Table 1 

Evolution of the prior error rate for the RF ABC model choice procedure as a function of the 

number of white noise variates. 


Extra variables 

optimal k 

prior error rate 

0 

20 

0.277 

2 

20 

0.368 

4 

140 

0.468 

6 

200 

0.491 

8 

260 

0.492 

10 

260 

0.526 

20 

260 

0.542 

50 

260 

0.548 

100 

500 

0.559 

200 

500 

0.572 

1000 

1000 

0.594 


Table 2 

Evolution of the prior error rate for a standard ABC model choice as a function of the number 

of white noise variates. 

evant variates grows much more slowly than for the standard ABC model choice 
technique, as shown in Table 2. In the latter case, a few extraneous variates suffice 
to propel the error rate above 50 %. 

6. HUMAN POPULATION GENETICS EXAMPLE 

We consider here the massive Single Nucleotide Polymorphism (SNP) dataset 
already studied in [25], associated with a MRCA population genetic model corre¬ 
sponding to Kingman’s coalescent that has been at the core of ABC implementa¬ 
tions from their beginning [33] . The dataset corresponds to individuals originating 
from four Human populations, with 30 individuals per population. The freely ac¬ 
cessible public 1000 Genome databases -|-http://www. 1000genomes.org/data has 
been used to produce this dataset. As detailed in [25] one of the appeals of using 
SNP data from the 1000 Genomes Project [1] is that such data does not suffer 
from any ascertainment bias. 

The four Human populations in this study included the Yoruba population 
(Nigeria) as representative of Africa, the Han Chinese population (China) as 
representative of East Asia (encoded CHB), the British population (England and 
Scotland) as representative of Europe (encoded GBR), and the population of 
Americans of African ancestry in SW USA (encoded ASW). After applying some 
selection criteria described in [25], the dataset includes 51,250 SNP loci scattered 
over the 22 autosomes with a median distance between two consecutive SNPs 
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Fig 7. Two scenarios of evolution of four Human populations genotyped at 50,000 SNPs. The 
genotyped populations are YRI = Yoruba (Nigeria, Africa), CHB = Han (China, East Asia), 
GBR = British (England and Scotland, Europe), and ASW = Americans of African ancestry 
(SW USA). 


equal to 7 kb. Among those, 50,000 were randomly chosen for evaluating the 
proposed RF ABC model choice method. 

In the novel study described here, we only consider two scenarios of evolution. 
These two models differ by the possibility or impossibility of a recent genetic 
admixture of Americans of African ancestry in SW USA between their African 
forebears and individuals of European origins, as described in Figure 7. Model 2 
thus includes a single out-of-Africa colonisation event giving an ancestral out-of- 
Africa population with a secondarily split into one European and one East Asian 
population lineage and a recent genetic admixture of Americans of African origin 
with their African ancestors and European individuals. RF ABC model choice 
is used to discriminate among both models and returns error rates. The vector 
of summary statistics is the entire collection provided by the DIYABC software 
for SNP markers [91, made of 112 summary statistics described in the manual of 
DIYABC. 

Model 1 involves 16 parameters while Model 2 has an extra parameter, the 
admixture rate r^. All times and durations in the model are expressed in number 
of generations. The stable effective populations sizes are expressed in number of 
diploid individuals. The prior distributions on the parameters appearing in one 
of the two models and used to generate SNP datasets are as follows: 

1. split or admixture time ti, Z7[l,30], 

2 . split times uniform on their support 

{{t2,h,U) G [100,10000]®3|i2 <f3 <t4}, 

3. admixture rate (proportion of genes with a non-African origin in Model 2) 
Ta ~~i7[0.05,0.95], 

4. effective population sizes Ni, N 2 , Y 3 , Y 3 and A^ 34 , ^[1000,100000], 

5. bottleneck durations ds, and ds^, Z^[5,500], 

6 . bottleneck effective population sizes Nbns, Nbn^ and Nbri'^, ZY[5,500], 

7. ancestral effective population size Na, ^[100,10000], 

For the analyses we use a reference table containing 19995 simulations: 10032 
from Model 1 and 9963 from Model 2. Figure 9 shows the distributions of the first 
EDA projection for both models, as a byproduct of the simulated reference table. 
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Unsurprisingly, this LDA component has a massive impact on the RF ABC model 
choice procedure. When including the LDA statistic, most trees (473 out of 500) 
allocate the observed dataset to Model 2. The second random forest to evaluate 
the local selection error leads a high confidence level: the estimated posterior 
probability of Model 2 is greater than 0.999. Figure 8 shows contributions for the 
most relevant statistics in the forest, stressing once again the primary role of the 
first LDA axis. Note that using solely this first LDA axis increases considerably 
the prior error rate. 


7. CONCLUSION 

This chapter has presented a solution for conducting ABC model choice and 
testing that differs from the usual practice in applied fields like population ge¬ 
netics, where the use of Algorithm 1 remains the norm. This choice is not due 
to any desire to promote our own work, but proceeds from a genuine belief that 
the figures returned by this algorithm cannot be trusted as approximating the 
actual posterior probabilities of the model. This belief is based on our experience 
along the years we worked on this problem, as illustrated by the evolution in our 
papers on the topic. 

To move to a machine-learning tool like random forests somehow represents a 
paradigm shift for the ABC community. For one thing, to gather intuition about 
the intrinsic nature of this tool and to relate it to ABC schemes is certainly any¬ 
thing but straightforward. For instance, a natural perception of this classification 
methodology is to take it as a natural selection tool that could lead to a reduced 
subset of significant statistics, with the side appeal of providing a natural dis¬ 
tance between two vectors of summary statistics through the tree discrepancies. 
However, as we observed through experiments, subsequent ABC model choice 
steps based on the selected summaries are detrimental to the quality of the clas¬ 
sification once a model is selected by the random forest. The statistical appeal 
of a random forest is on the opposite that it is quite robust to the inclusion of 
poorly informative or irrelevant summary statistics and on the opposite able to 
catch minute amounts of additional information produced by such additions. 

While the current state-of-the-art remains silent about acceptable approxima¬ 
tions of the true posterior probability of a model, in the sense of being conditional 
to the raw data, we are nonetheless making progress towards the production of 
an approximation conditional on an arbitrary set of summary statistics, which 
should offer strong similarities with the above. That this step can be achieved at 
no significant extra-cost is encouraging for the future. 

Another important inferential issue pertaining ABC model choice is to test a 
large collection of models. The difficulties to learn how to discriminate between 
models certainly increase when the number of likelihoods in competition gets 
larger. Even the most up-to-date machine learning algorithms will loose their 
efficiency if one keeps constant the number of iid draws from each model, with¬ 
out mentioning that the time complexity will increase linearly with the size of 
the collection to produce the reference table that trains the classifier. Thus this 
problem remains largely open. 
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Fig 8. Contributions of the most frequent statistics in the RF. The contribution of a summary 
statistic is evaluated as the average decrease in node impurity at all nodes where it is selected, 
over the trees of the RF when using the 112 summary statistics (top) and when further adding 
the first LDA axis (bottom). 
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10 


Fig 9. Distribution of the first LDA axis derived from the reference table, in red for Model 1 
and in blue for Model 2. The observed dataset is indicated by a black vertical line. 
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